Information Distance in Multiples 

Paul M.B. Vitanyi 



Abstract 

Information distance is a parameter-free similarity measure based on compression, used in pattern 
recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is 
extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universality, minimal 
overlap, additivity, and normalized information distance in multiples. We use the theoretical notion of 
Kolmogorov complexity which for practical purposes is approximated by the length of the compressed 
version of the file involved, using a real-world compression program. 

Index Terms — Information distance, multiples, pattern recognition, data mining, similarity, Kol- 
mogorov complexity 

I. Introduction 

In pattern recognition, learning, and data mining one obtains information from objects containing 
information. This involves an objective definition of the information in a single object, the information 
to go from one object to another object in a pair of objects, the information to go from one object to 
any other object in a multiple of objects, and the shared information between objects, [34]. 

The classical notion of Kolmogorov complexity [21] is an objective measure for the information in an 
a single object, and information distance measures the information between a pair of objects [3]. This last 
notion has spawned research in the theoretical direction, among others [6], [37], [38], [39], [30], [35]. 
Research in the practical direction has focused on the normalized information distance, the similarity 
metric, which arises by normalizing the information distance in a proper manner and approximating the 
Kolmogorov complexity through real-world compressors [26], [7], [8], [9], This normalized information 
distance is a parameter-free, feature-free, and alignment-free similarity measure that has had great impact 
in applications. A variant of this compression distance has been tested on all time sequence databases 
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used in the last decade in the major data mining conferences (sigkdd, sigmod, icdm, icde, ssdb, vldb, 
pkdd, pakdd) [18]. The conclusion is that the method is competitive with all 51 other methods used and 
superior in heterogenous data clustering and anomaly detection. In [4] it was shown that the method is 
resistant to noise. This theory has found many applications in pattern recognition, phylogeny, clustering, 
and classification. For objects that are represented as computer files such applications range from weather 
forecasting, software, earthquake prediction, music, literature, ocr, bioinformatics, to internet [1], [2], [5], 
[10], [8], [9], [12], [19], [20], [23], [22], [25], [33], [31], [32], [40]. For objects that are only represented 
by name, or objects that are abstract like 'red,' 'Einstein,' 'three,' the normalized information distance 
uses background information provided by Google, or any search engine that produces aggregate page 
counts. It discovers the 'meaning' of words and phrases in the sense of producing a relative semantics. 
Applications run from ontology, semantics, tourism on the web, taxonomy, multilingual questions, to 
question-answer systems [15], [16], [42], [36], [41], [43], [17], [13], [14]. For more references on either 
subject see the textbook [28] or Google Scholar for references to [26], [8], [9]. 

However, in many applications we are interested in shared information between many objects instead 
of just a pair of objects. For example, in customer reviews of gadgets, in blogs about public happenings, 
in newspaper articles about the same occurrence, we are interested in the most comprehensive one or the 
most specialized one. Thus, we want to extend the information distance measure from pairs to multiples. 

A. Related Work 

In [27] the notion is introduced of the information required to go from any object in a multiple of 
objects to any other object in the multiple. This is applied to extracting the essence from, for example, a 
finite list of internet news items, reviews of electronic cameras, tv's, and so on, in a way that works better 
than other methods. Let X denote a finite list of m finite binary strings defined by X = (x±, . . . ,x m ), 
the constituting strings ordered length-increasing lexicographic. We use lists and not sets, since if X is a 
set we cannot express simply the distance from a string to itself or between strings that are all equal. Let 
U be the reference universal Turing machine, for convenience the prefix one as in Section ITT] Given the 
string X{ we define the information distance to any string in X by E max (X) = min{|p| : U (xi,p,j) = xj 
for all Xi,Xj G X}. It is shown in [27], Theorem 2, that 

E m3X (X) = max K(X\x), (1.1) 

x:x£X 
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up to a logarithmic additive term. Define E m - m (X) = min x:xe x K(X\x). Theorem 3 in [27] states that 
for every list X = (x%, . . . , x m ) we have 

£ m inPO < E ma _ x (X) < min V] E max (xi , x k ) , (1.2) 

i:l<i<m L — » 

Xi,x k GX & k^i 

up to a logarithmic additive term. This is not a corollary of (11.11 ) as stated in [27], but both inequalities 
follow from the definitions. The lefthand side is interpreted as the program length of the "most 
comprehensive object that contains the most information about all the others [all elements of X]," and 
the righthand side is interpreted as the program length of the "most specialized object that is similar to 
all the others." The paper [27] develops the stated results and applications. It does not develop the theory 
in any detail. That is the purpose of the present paper. 

B. Results 

Information distance for multiples, that is, finite lists, appears both practically and theoretically 
promising. In all cases below the results imply the corresponding ones for the pairwise information 
distance defined as follows. The information distance in [3] between strings x\ and X2 is E max (xx, x^) = 
raax.{K(x\\x2),K(x2\x\)}. In the current paper E max (X) = max x . x£ x K(X\x). These two definitions 
coincide for \X\ = 2 since K(x, y\x) = K(y\x) up to an additive constant term. We investigate the 
maximal overlap of information (Theorem 13.11 ) which for |X| = 2 specializes to Theorem 3.4 in [3], 
Corollary |3.2| shows dl.ll ) and Corollary [33] shows that the lefthand side of (II.2I ) can be taken to correspond 
to a single program embodying the "most comprehensive object that contains the most information about 
all the others" as stated but not argued or proved in [27]; metricity (Theorem 14.11 ) and universality 
(Theorem 15.21 ) which for \XY\ = 2 (for metricity) and \X\ = 2 (for universality) specialize to Theorem 
4.2 in [3]; additivity (Theorem 16. Ik minimum overlap of information (Theorem 17.11 ) which for \X\ = 2 
specializes to Theorem 8.3.7 in [29]; and the nonmetricity of normalized information distance for lists of 
more than two elements and certain proposals of the normalizing factor (Section I VIII I ). In contrast, for 
lists of two elements we can normalize the information distance as in Lemma V.4 and Theorem V.7 of 
[26]. The definitions are of necessity new as are the proof ideas. Remarkably, the new notation and proofs 
for the general case are simpler than the mentioned existing proofs for the particular case of pairwise 
information distance. 
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II. Preliminaries 

Kolmogorov complexity: This is the information in a single object [21]. The notion has been the 
subject of a plethora of papers. Informally, the Kolmogorov complexity of a finite binary string is the 
length of the shortest string from which the original can be losslessly reconstructed by an effective 
general-purpose computer such as a particular universal Turing machine. Hence it constitutes a lower 
bound on how far a lossless compression program can compress. For technical reasons we choose Turing 
machines with a separate read-only input tape, that is scanned from left to right without backing up, a 
separate work tape on which the computation takes place, and a separate output tape. Upon halting, the 
initial segment p of the input that has been scanned is called the input "program" and the contents of 
the output tape is called the "output." By construction, the set of halting programs is prefix free. We 
call U the reference universal prefix Turing machine. This leads to the definition of "prefix Kolmogorov 
complexity" which we shall designate simply as "Kolmogorov complexity." 

Formally, the conditional Kolmogorov complexity K(x\y) is the length of the shortest input z such that 
the reference universal prefix Turing machine U on input z with auxiliary information y outputs x. The 
unconditional Kolmogorov complexity K(x) is defined by K(x\e) where e is the empty string (of length 
0). In these definitions both x and y can consist of a nonempty finite lists of finite binary strings. For 
more details and theorems that are used in the present work see Appendix |K) 

Lists: A list is a multiple X = (x\, . . . ,x m ) of m < oo finite binary strings in length-increasing 
lexicographic order. If X is a list, then some or all of its elements may be equal. Thus, a list is not a 
set but an ordered bag of elements. With some abuse of the common set-membership notation we write 
Xi € X for every i (0 < i < m) to mean that "xj is an element of list X." The conditional prefix 
Kolmogorov complexity K(X\x) of a list X given an element x is the length of a shortest program p 
for the reference universal Turing machine that with input x outputs the list X. The prefix Kolmogorov 
complexity K(X) of a list X is defined by K(X\e). One can also put lists in the conditional such as 
K(x\X) or K(X\Y). We will use the straightforward laws K(-\X,x) = K(-\X) mdK(X\x) = K{X'\x) 
up to an additive constant term, for x € X and X' equals the list X with the element x deleted. 

Information Distance: To obtain the pairwise information distance in [3] we take X = {x\,x%\ in 
dTTTb . Then (Q7T|> is equivalent to E mBX (xx, X2) = max.{K(xi\x2), K{x2\x\)}. 
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III. Maximal Overlap 

We use the notation and terminology of Section ll-Al Define k\ = E m i n (X), k^ = E max (X), and 
I = k 2 — k\. We prove a maximal overlap theorem: the information needed to go from any xi to any 
Xk in X can be divided in two parts: a single string of length k\ and a string r of length / (possibly 
depending on xi), everything up to an additive logarithmic term. 

Theorem 3.1: A single program of length k\ + K(m, k\, k 2 ) + logm + O(l) bits concatenated with a 
string of I bits, possibly depending on i, suffice to find X from Xi for every Xi € X. To find an arbitrary 
element Xf. € X from Xi it suffices to concatenate at most another logm bits, possibly depending on i 
and k. 

Proof: Enumerate the finite binary strings lexicographic length-increasing as si,s 2 , . . . . Let G = 
(V, E) be a graph defined as follows. Let A be the set of finite binary strings and B the set of vectors 
of strings in A defined by v = (si, . . . , s m ) such that 

min {K(si, . . . ,s m \sj)} < ki, 

j:l<j<m 

max {K(si, . . . ,s m \sj)} < k 2 - 

j:l<j<m 

Given k\ and k 2 the set B can be enumerated. Define V = A |J B. Define E by length-increasing 
lexicographic enumerating A x B and put (rs, v) S E with rs G A and v = (s\, . . . , s m ) G B if s = Sj 
for some j (1 < j < m), where r is chosen as follows. It is the [i/2 fcl ]th string of length I where i is 
the number of times we have used s £ A. So the first 2 kl times we choose an edge (-s, ■) we use 0', the 
next 2 kl we use C/ -1 l, and so on. In this way, i < 2 k2 so that i/2 kl < 2 l . By adding r to s we take care 
that the degree of rs is at most 2 kl and not at most 2 k2 as it could be without the prefix r. The degree 
of a node v € B is trivially m. 

In addition, we enumerate B length-increasing lexicographic and 'color' everyone of the m edges 
incident with an enumerated vector v E B with the same binary string c of length k\ + logm. If 
v = (si, . . . , s m ) and v is connected by edges to nodes r\s\, . . . , r m s m , then choose c as the minimum 
color not yet appearing on any edge incident with any rjSj (1 < j < m). Since the degree of every node 
rs G A is bounded by 2 kl and hence the colors already used for edges incident on nodes riSi, . . . , r m s m 
number at most ^2i < j <m {2 kl — 1) = m2 kl — m, a color is always available. 

Knowing m, k\, k 2 one can reconstruct G and color its edges. Given an element x from the list X, and 
knowing the appropriate string r of length / and the color c of the edge (rx, X), we can find X. Hence 
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a single program, say p, of length k\ + K (m, k±, &2) + logm + 0(1) bits suffices to find X from rx for 
any x £ X and with |r| = I. An additional logm bits suffice to select any element of X. Taking these 
log m bits so that they encode the difference from i to k mod m we can compute from every Xi S X 
to every xu £ X and vice versa with the same program p of length k\ + K(m, k\, &>2) + logm + 0(1) 
concatenated with a string r of length / and a string of length log m, both possibly depending on i and 
k. Since we know m,ki,k2 from the fixed program p, where they are encoded as a self-delimiting prefix 
of length K(m, k±, A^) say, we can concatenate these strings without separation markers and reconstruct 
them. ■ 

Corollary 3.2: Since k\ + 1 = /c2, the theorem implies (11.11 ). that is, Theorem 2 of [27]. 
It is not a priori clear that E m - m (X) in the lefthand side of (11.21 ) corresponds to a single program that 
represents the information overlap of every shortest program going from any Xi to the list X. This seems 
in fact assumed in [27] where E m [ n (X) is interpreted as the [Kolmogorov complexity of] "the most 
comprehensive object that contains the most information about all the others." In fact, for every Xi G X 
we can choose a shortest program going from Xi to the list X so that these programs have pairwise no 
information overlap at all (Theorem 17.1b . But here we have proved: 

Corollary 3.3: The quantity E m - m (X) corresponds to a single shortest program that represents the 
maximum overlap of information of all programs going from Xi to the list X for any x-i € X. 

IV. Metricity 

We consider nonempty finite lists of finite binary strings, each list ordered length-increasing lexico- 
graphic. Let X be the set of such ordered nonempty finite lists of finite binary strings. A distance function 
d on X is defined by d : X — > 7Z + where 1Z + is the set of nonnegative real numbers. Define W = UV 
if W is a list of the elements of the lists U and V and the elements of W are ordered length-increasing 
lexicographical. A distance function d is a metric if X, Y, Z ^ and 

1) Positive definiteness: d{X) = if all elements of X are equal and d{X) > otherwise. 

2) Symmetry: d(X) is invariant under all permutations of X. 

3) Triangle inequality: d(XY) < d(XZ) + d(ZY). 

Theorem 4.1: The information distance for lists, E m3iX , is a metric where the (in)equalities hold up to 
a O(logK) additive term. Here K is the largest quantity involved in the metric (in)equalities. 

Proof: It is clear that £' max (X) satisfies positive definiteness and symmetry up to an O(logi^) 
additive term where K = K(X). It remains to show the triangle inequality. 
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Claim 4.2: Let X, Y, Z be three nonempty finite lists of finite binary strings and K = 
m&x{K(X),K(Y),K(Z)}. Then, E max (XY) < E max (X Z) + E max (ZY) up to an O(logiT) additive 
term. 

Proof: By Theorem 13. II 



E ma , x (XY) 


= max 
x-.xeXY 


E ma , x (XZ) 


= max 

x:x&XZ 


E max (ZY) 


= max 

x:xeZY 



equalities up to a 0(log K) additive term. Here xxy,xxz, xzy are the elements for which the maximum 
is reached for the respective £ max 's. 

Assume that xxy G X, the case xxy £ Y being symmetrical. Let z be some element of Z. Then, 

K(XY\x X y) < K(XYZ\x X y) 

< K(XZ\x X y) + K(Y\XZ, xxy) 

< K(XZ\xxz)+K(Y\XZ lZ ) 

< K(XZ\xxz)+K(ZY\x ZY ). 

The first inequality follows from the general K{u) < K(u,v), the second inequality by the obvious 
subadditive property of K(-), the third inequality since in the first term xxy £ XZ and the 
mayL x:X £x z {K (X Z\x)} is reached for x = xxz and in the second term both xxy £ X and for z 
take any element from Z, and the fourth inequality follows by in the second term dropping X from the 
conditional and moving Z from the conditional to the main argument and observing that both z G ZY 
and the max. x:xe zY{K(ZY\x)} is reached for x = xzy- The theorem follows with (in)equalities up to 
an OQogK) additive term. ■ 



V. Universality 

Let X G X . A priori we allow asymmetric distances. We would like to exclude degenerate distance 
measures such as D{X) = 1 for all X. For each d, we want only finitely many lists X such that 
D(X) < d. Exactly how fast we want the number of lists we admit to go to oo is not important; it 
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is only a matter of scaling. For every distance D we require the following density condition for every 
x € {0,1}*: 

Y, 2-°W<l. (V.l) 

X:x&X & D(X)>0 

Thus, for the density condition on D we consider only lists X with \X\ > 2 and not all elements of X 
are equal. Moreover, we consider only distances that are computable in some broad sense. 

Definition 5.1: An admissible list distance D(X) is a total, possibly asymmetric, function from X to 
the nonnegative real numbers that is if all elements of X are equal, and greater than otherwise (up 
to an additive log if additive term with K = K(X)), is upper semicomputable, and satisfies the density 
requirement in (IV. lb . 

Theorem 5.2: The list information distance E max (X) is admissible and it is minimal in the sense that 
for every admissible list distance function D(X) we have E max {X) < D(X) up to an additive constant 
term. 

Proof: It is straightforward that E m3iX (X) is a total real-valued function, is only if all elements 
of X are equal and unequal otherwise (up to an OilogK) additive term with K = K(X)), and 
is upper semicomputable. We verify the density requirement of (IV. lb . For every x € {0, 1}*, consider 
lists X of at least two elements not all equal and x € X. Define functions f x (X) = 2~ K ( X \ X \ Then, 
fx(X) > 2^ E ^ X \ It is easy to see that for every x £ {0, 1}*, 

X:xeX & E max (X)>0 X:xeX & £ msm (X)>0 X:x&X & £ max (X)>0 

where the righthand sum is taken over all programs p for which the reference prefix machine U, given 
x, computes a finite list X of at least two elements not all equal and such that x € X. This sum is the 
probability that U, given x, computes such a list X from a program p generated bit by bit uniformly at 
random. Therefore, the righthand sum is at most 1, and E mSLX (X) satisfies the density requirement (IV. II ). 

We prove minimality. Fix any x G {0, 1}*. Since D is upper semicomputable, the function / defined 
by f(X, x) = 2~ D ( X ' ) for X satisfying x G X and D(X) > 0, and otherwise, is lower semicomputable. 
Since Ylx-xex & D(x)>o 2~ D ( X ' > < 1, we have Ylx f{X, x) < 1 for every x. Note that given D we can 
compute /, and hence K(f) < K(D) + O(l). By the conditional version of dIX.2b in [28] Theorem 
4.3.2, we have c D m(X\x € X) > f(X,x) with c D = 2 K ^ = 2 K ^ +oi - l \ that is, c D is a positive 
constant depending on D only. By the conditional version of (IIX.3b in [28] Theorem 4.3.4, we have for 
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every x € X that log l/m(X\x) = K(X\x) + O(l). Hence, for every i£l we have log 1/ f(X, x) < 
K(X\x) + logl/co + O(l). Altogether, for every admissible distance D and every x € {0,1}*, and 
every list X satisfying x £ X, there is a constant cd such that D(X) < K(X\x) + logl/co + O(l). 
Hence, D(X) < E max (X) + log l/c D + 0(1). ■ 

VI. Additivity 

Theorem 6.1: E max is not subadditive: neither E max (X) + E max (Y) < E max (XY) nor E max (X) + 
£ max (F) > E max (XY), the (in)equalities up to logarithmic additive terms, holds for all lists X, Y. 

Proof: Below, all (in)equalities are taken up to logarithmic additive terms. Let x, y be strings of 
length n, X = (e, x) and Y = (e, y) with e denoting the empty word. Then E majX (XY) = E max (e, e, x, y), 
E max (X) = K{x), and E max (Y) = K(y). If x = y and K{x) = n, then E max (XY) = E max (X) = 
E max {Y) = n. Hence, E m&x {XY) < E mSLX {X) + E max (Y). 

Let x,y be strings of length n such that = K(y\x) = n, K(x),K(y) > n, X = (x,x), and 

Y = (y,y). Then E max (XY) = E max (x , x , y , y) = max{K(a;|y), K{y\x)} = n, E mSLX (X) = 0, and 
^max(^) = 0. Hence, E max (XY) > E max (X) + E max {Y). ■ 

Let X = (x) and Y = (y). Note that subadditivity holds for lists of singleton elements since 
E max (x,y) = max{K(x\y), K(y\x)} < K(x) + K(y), where the equality holds up to an additive 
0(\og{K(x\y),K(y\x)}) term and the inequality holds up to an additive constant term.. 

VII. Minimal Overlap 

Let X = (x±, . . . ,x m ) and pi be a shortest program converting Xi to X (1 < i < m). Naively we 
expect that the shortest program that that maps xi to X contains the information about X that is lacking 
in Xi. However, this is too simple, because different short programs mapping Xi to X may have different 
properties. 

For example, suppose X = {x, y} and both elements are strings of length n with K(x\y), K(y\x) > n. 
Let p be a program that ignores the input and prints x. Let q be a program such that y © q = x (that is, 
q = x © y), where © denotes bitwise addition modulo 2. Then, the programs p and q have nothing in 
common. 

Now let x and y be arbitrary strings of length at most n. Muchnik, Theorem 8.3.7 in [29], shows that 
there exists a shortest program p that converts y to x (that is, \p\ = K(x\y) and K(x\p,y) = O(logn)), 
such that p is simple with respect to x and therefore depends little on the origin y, that is, K(p\x) = 
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O(logra). This is a fundamental coding property for individual strings that parallels related results about 
random variables known as the Slepian-Wolf and Csiszar-Korner-Marton theorems [11]. 

Theorem 7.1: Let X = (x\, . . . ,x m ) be a list of binary strings of length at most n. For every xi G 
X there exists a string pi of length K(X\xi) such that K(pi\X) = O(logmn) and K(X\pi, xi) = 
0(log mn). 

Proof: Muchnik's theorem as stated before gives a code p for x when y is known. There, we 
assumed that x and y have length at most n. The proof in [29] does not use any assumption about y. 
Hence we can extend the result to information distance in finite lists as follows. Suppose we encode the 
constituent list elements of X self-delimitingly in altogether mn + O (log mn) bits (now X takes the 
position of x and we consider strings of length at most mn + O(logmn)). Substitute y by Xi for some i 
(1 < i < m). Then the theorem above follows straightforwardly from Muchnik's original theorem about 
two strings of length at most n. ■ 

The code pi is not uniquely determined. For example, let X = (x,y) and z be a string such that 
\ x \ = \u\ = \ z \ = n > K{y\ z ) = K(z\y) > n, and and x = y © z. Then, both z and y © z can be used for 
p with K(p\X) = 0(1) and K(X\p, y) = 0(1). But z and y © z have no mutual information at all. 

Corollary 7.2: LetJT = (x\, x m ). For every string xi there is a program^ such that U(pi, xi) = X 
(1 < i < m), where \pi\ = K(X\xi), and K(pi) - K(pi\pj) = K(pj) - K(pj\pi) = (i / j), and the 
last four equalities hold up to an additive 0(logK(X)) term. 

VIII. Normalized List Information Distance 

The quantitative difference in a certain feature between many objects can be considered as an admissible 
distance, provided it is upper semicomputable and satisfies the density condition dV.ll ). Theorem |5.2| shows 
that -Emax is universal in that among all admissible list distances in that it is always least. That is, it 
accounts for the dominant feature in which the elements of the given list are alike. Many admissible 
distances are absolute, but if we want to express similarity, then we are more interested in relative ones. 
For example, if two strings of 1,000,000 bits have information distance 1,000 bits, then we are inclined 
to think that those strings are relatively similar. But if two strings of 1,000 bits have information distance 
1,000 bits, then we find them very different. 

Therefore, our objective is to normalize the universal information distance E m3iX to obtain a universal 
similarity distance. It should give a similarity with distance when the objects in a list are maximally 
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similar (that is, they are equal) and distance 1 when they are maximally dissimilar. Naturally, we desire 
the normalized version of the universal list information distance metric to be also a metric. 
For pairs of objects, say x, y, the normalized version e of £? max defined by 

, n _ E m£kX (x,y) = max{K(x,y\x),K(x,y\y} ^ m ^ 

max{K(x), K(y)} max{K (x) , K (y)} 

takes values in [0, 1] and is a metric. Several alternatives for the normalizing factor 1/ max{K(x), K (y)} 
do not work. Dividing by the length, either the sum or the maximum does not satisfy the triangle 
property. Dividing by K(x,y) results in e\(x,y) = E max (x, y)/K(x, y) = \ for \x\ = \y\ = n and 
K(x\y) = K(y\x) > n (and hence K(x),K(y) > n), and this is improper as e\(x, y) should be 1 in this 
case. We would like a proposal for a normalization factor for lists of more than two elements to reduce 
to that of dVIII. 1 1 ) for lists restricted to two elements. This leads to the proposals below, which turn out 
to be improper. 

As a counterexample to normalization take the following lists: X = (x), Y = (y), and Z = (y, y). With 
\ x \ = \y\ = n an( l tne equalities below up to an O(logn) additive term we define: K(x) = K(x\y) = 
K{x,y\y) = K(x,y,y\y) = n, K(y\x) = K(y,y\x) = K(x,y,y\x) = 0.9n, and K(y) = K(y,y) = 
0.9n. Using the symmetry of information (IIX.1I ) we have K(x, y) = 1.9n. Let U, V, W be lists. We show 
that for the proposals below the triangle property e(UV) < e(UW) + e(WV) is violated. 

• Consider the normalized list information distance 

That is, we divide E max {V) by K(V mSLX ) with V m3iX = maxj{Vi} where the list V{ equals the list V 
with the ith element deleted (1 < % < |V|). Then, with equalities holding up to 0((log n)/n) we have: 
e(XY) = K(x,y\y)/K(x) = 1, e(XZ) = E nmx (XZ)/K(XZ mSiX ) = K{x,y,y\y)/K{x,y) = \, 
and e(ZY) = E max (ZY) / K ( ZY max ) = K(y,y,y\y)/K(y,y) = 0. Hence the triangle inequality 
does not hold. 

• Instead of dividing by K(V maiX ) in (I VIII. 21 ) divide by K(V) where V equals V with xy deleted. 
The same counterexample to the triangle inequality holds. 

• Instead of dividing by K(V max ) in (IVIII.2I) divide by K({V max }) where {y max } is the set of elements 
in V m3iX . To equate the sets approximately with the corresponding lists, change Z to {2/1,2/2} where 
yi equals y but with the ith bit flipped (i = 1, 2). Again, the triangle inequality does not hold. 
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• Instead of dividing by K{V) in (I VIII. 21 ) divide by K({V'}) where {V} is the set of elements in 
V. Change Z as in the previous item. Again, the triangle inequality does not hold. 

IX. Appendix: Kolmogorov Complexity Theory 

Theory and applications are given in the textbook [28]. Here we give some relations that are needed in 
the paper. The information about x contained in y is defined as I(y : x) = K(x) — K{x\y). A deep, and 
very useful, result due to L.A. Levin and A.N. Kolmogorov [44] called symmetry of information shows 
that 

K(x, y) = K(x) + K{y\x) = K(y) + K(x\y), (TX.l) 

with the equalities holding up to logi'T additive precision. Here, K = max{K(x), K(y)}. Hence, up to 
an additive logarithmic term I(x : y) = I(y : x) and we call this the mutual (algorithmic) information 
between x and y. 

The universal a priori probability of x is Qu{%) = Ylu(p)=x^ ~' P '- The following results are due to 
L.A. Levin [24]. 

There exists a lower semicomputable function m : {0, 1}* — » [0, 1] with J2 X m ( x ) — L suc h that for 
every lower semicomputable function P : {0, 1}* — > [0, 1] with J2 X P(x) < 1 we have 

2 K(p) m(x) > P(x), (IX.2) 

for every x. Here K(P) is the length of a shortest program for the reference universal prefix Turing 
machine to lower semicompute the function P. For every x G {0, 1}*, 

K(x) = - log Quip) = ~ log m(a;) (IX.3) 

with equality up to an additive constant independent of x. Thus, the Kolmogorov complexity of a string x 
coincides up to an additive constant term with the logarithm of \/Qu(x) and also with the logarithm of 
l/m(x). This result is called the "Coding Theorem" since it shows that the shortest upper semicomputable 
code is a Shannon-Fano code of the greatest lower semicomputable probability mass function. 
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